attention transfer
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
Model compression using knowledge distillation with integrated gradients
Hernandez, David E., Chang, Jose, Nordling, Torbjörn E. M.
Model compression is critical for deploying deep learning models on resource-constrained devices. We introduce a novel method enhancing knowledge distillation with integrated gradients (IG) as a data augmentation strategy. Our approach overlays IG maps onto input images during training, providing student models with deeper insights into teacher models' decision-making processes. Extensive evaluation on CIFAR-10 demonstrates that our IG-augmented knowledge distillation achieves 92.6% testing accuracy with a 4.1x compression factor-a significant 1.1 percentage point improvement ($p<0.001$) over non-distilled models (91.5%). This compression reduces inference time from 140 ms to 13 ms. Our method precomputes IG maps before training, transforming substantial runtime costs into a one-time preprocessing step. Our comprehensive experiments include: (1) comparisons with attention transfer, revealing complementary benefits when combined with our approach; (2) Monte Carlo simulations confirming statistical robustness; (3) systematic evaluation of compression factor versus accuracy trade-offs across a wide range (2.2x-1122x); and (4) validation on an ImageNet subset aligned with CIFAR-10 classes, demonstrating generalisability beyond the initial dataset. These extensive ablation studies confirm that IG-based knowledge distillation consistently outperforms conventional approaches across varied architectures and compression ratios. Our results establish this framework as a viable compression technique for real-world deployment on edge devices while maintaining competitive accuracy.
- Asia > Taiwan (0.04)
- Oceania > Australia > Western Australia > Perth (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Education (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
On the Surprising Effectiveness of Attention Transfer for Vision Transformers
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet.
On the Surprising Effectiveness of Attention Transfer for Vision Transformers
Li, Alexander C., Tian, Yuandong, Chen, Beidi, Pathak, Deepak, Chen, Xinlei
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning
LoLCATs: On Low-Rank Linearizing of Large Language Models
Zhang, Michael, Arora, Simran, Chalamala, Rahul, Wu, Alan, Spector, Benjamin, Singhal, Aaryan, Ramesh, Krithik, Ré, Christopher
Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. We base these steps on two findings. First, we can replace an LLM's softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss ("attention transfer"). Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. Furthermore, LoLCATs does so with only 0.2% of past methods' model parameters and 0.4% of their training tokens. Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Sweden (0.04)
- (6 more...)
- Textiles, Apparel & Luxury Goods (0.68)
- Information Technology (0.67)
- Retail (0.46)
- Semiconductors & Electronics (0.45)
Human-to-Robot Attention Transfer for Robot Execution Failure Avoidance Using Stacked Neural Networks
Song, Boyi, Peng, Yuntao, Luo, Ruijiao, Liu, Rui
Due to world dynamics and hardware uncertainty, robots inevitably fail in task executions, leading to undesired or even dangerous executions. To avoid failures for improved robot performance, it is critical to identify and correct robot abnormal executions in an early stage. However, limited by reasoning capability and knowledge level, it is challenging for a robot to self diagnose and correct their abnormal behaviors. To solve this problem, a novel method is proposed, human-to-robot attention transfer (H2R-AT) to seek help from a human. H2R-AT is developed based on a novel stacked neural networks model, transferring human attention embedded in verbal reminders to robot attention embedded in robot visual perceiving. With the attention transfer from a human, a robot understands what and where human concerns are to identify and correct its abnormal executions. To validate the effectiveness of H2R-AT, two representative task scenarios, "serve water for a human in a kitchen" and "pick up a defective gear in a factory" with abnormal robot executions, were designed in an open-access simulation platform V-REP; $252$ volunteers were recruited to provide about 12000 verbal reminders to learn and test the attention transfer model H2R-AT. With an accuracy of $73.68\%$ in transferring attention and accuracy of $66.86\%$ in avoiding robot execution failures, the effectiveness of H2R-AT was validated.
- North America > United States > Washington > King County > Seattle (0.04)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- Oceania > Australia > Queensland > Brisbane (0.04)
- (6 more...)
Moonshine: Distilling with Cheap Convolutions
Crowley, Elliot J., Gray, Gavin, Storkey, Amos J.
Many engineers wish to deploy modern neural networks in memory-limited settings; but the development of flexible methods for reducing memory use is in its infancy, and there is little knowledge of the resulting cost-benefit. We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data.
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.04)